C/C++ Includes as graphs

Intro

At some point of time you want to see the dependencies inside of your project.
This can be useful for optimization of compilation times, and also from design perspective.
One of the nice representation is graph where vertexes is files and edges is include statements.
This representation is very powerful for analysis because it gives us the whole graph theory tricks to use.
I have used it to find which files it is better and safe to include into precompilation header, and also to gain some nice overview of the project.

Technologies used

I have ended with python + orientdb + gephi.

My first try for our relatively big project was python + graphviz.
I have failed with graphviz because it was trying to produce a png for 4 hours(4k vertexes + 30k edges).

The second attempt was with graph database.
I have decided that for analysis purposes it is a way better to load everything in database which allows you to do custom queries.
So I took orientdb - installation very straightforward and the documentation is nice.

I will use visualization of CPython project here.

Short OrientDB intro

OrientDB allows you to do SQL like queries over its database and also have a beautiful one web client interface.
screenshot of orientdb web interface

It is document oriented db with possibility to store graphs. It can be schema-less, or you can create a scheme and use it.
Classes is almost like tables in regular one SQL(actually it is not so, but for simplicity reasons lets assume it is so).

It also have a nice one python bindings.

Putting it all together

Source code

You can obtain source code from bitbucket
https://bitbucket.org/dehun/cpp_dependencies/src

Or you can read it below

import re
import os
import fileinput
import multiprocessing
import pyorient


def get_file_list():
    return [f.strip() for f in fileinput.input()]


def disambiguate_filename(path):
    return os.path.basename(path)


def disambiguate_dependency(dep):
    return (disambiguate_filename(dep[0]),
            disambiguate_filename(dep[1]))


def disambiguate_dependencies(deps):
    return [disambiguate_dependency(dep) for dep in deps]


def read_dependencies_from_file(filepath):
    print '[i] reading dependencies from file ' + filepath
    with open(filepath) as fin:
        return [(filepath, dep) for dep in
                re.findall(r'#include ["](.+?(?:h|hpp))["]', fin.read())]


def read_dependencies_from_files(file_list):
    # read raw dependencies
    dependencies = []
    pool = multiprocessing.Pool(100)
    results = pool.map(read_dependencies_from_file, file_list)
    dependencies = [item for sublist in results for item in sublist]

    # disambiguate
    dependencies = disambiguate_dependencies(dependencies)
    # count em
    counted_deps = set()
    for left_dep in dependencies:
        count = len([d for d in dependencies if d == left_dep])
        counted_deps.add((left_dep[0], left_dep[1], count))
    print 'got %s dependencies' % (len(counted_deps))

    return counted_deps


def save_graph(deps):
    print 'saving graph'
    vertices = set()
    for d in deps:
        vertices.add(d[0])
        vertices.add(d[1])

    edges = deps

    orient = pyorient.OrientDB()
    orient.connect("root", "system")
#    orient.db_delete("cppdepdb")
    orient.db_create("cppdepdb", pyorient.DB_TYPE_GRAPH)
    orient.db_open("cppdepdb", "root", "system")
    orient.command("create class cppfile extends V")
    orient.command("create class cppinclude extends E")
    for v in vertices:
        orient.command("create vertex cppfile set name='%s'" % v)

    for edge in edges:
        orient.command("create edge cppinclude from"\
                       "(select from cppfile where name='%s') to"\
                       "(select from cppfile where name='%s') set count='%s'"
                       % (edge[0], edge[1], edge[2]))

    orient.db_close()


def print_leaderboard(deps):
    vertices_ins = {}
    for dep in deps:
        if dep[1] not in vertices_ins:
            vertices_ins[dep[1]] = 0
        vertices_ins[dep[1]] += dep[2]

    vertices_s = sorted(vertices_ins.items(),
                        key=lambda x: x[1])
    for v in vertices_s:
        print v


def main():
    files_list = get_file_list()
    deps = read_dependencies_from_files(files_list)
    print_leaderboard(deps)
    save_graph(deps)


if __name__ == '__main__':
    main()

Usage

The script accepts list of files in stdin.

find /path/to/your/project -name "*.h" > my_project_files.list
find /path/to/your/project -name "*.cpp" > my_project_files.list
cat my_project_files.list | python cpp_dependencies.py

Some notes

Playing with the orientdb

Go to web interface and choose graph.
Now we can make a traverse from a file to see who is including this file and how.

TRAVERSE in(), inE() FROM (SELECT * FROM cppfile WHERE name='util.h') WHILE $depth < 10000 LIMIT 10000

And you will get a nice graph dependency for "util.h".
But this can be very slow in case of many nodes.
To display file names instead of ids just click on any node, click on eye icon(show), go to configure and choose name.

Visualization

The orientdb web client is not good enough when it comes to visualization of big one graphs.
Here you will need something more powerful. I have searched across the internet and the most suitable project I found was Gephi.
http://gephi.github.io/

To connect gephi to orientdb we will need to install streaming plugin.
Please see instructions in detail here - https://code.google.com/p/orient/wiki/Gephi

For full includes whole graph just add next stream - http://localhost:2480/gephi/cppdepdb/sql/select%20from%20cppfile/500000

screenshot of gephi visualization

You can also visualize traverses which sometimes can be useful.